System and method for analyzing remote traffic data in a distributed computing environment

ABSTRACT

A system, method and storage medium embodying computer-readable code for analyzing traffic data in a distributed computing environment are described. The distributed computing environment includes a plurality of interconnected systems operatively coupled to a server, a source of traffic data hits and one or more results tables categorized by an associated data type. Each results table includes a plurality of records. The server is configured to exchange data packets with each interconnected system. Each traffic data hit corresponds to a data packet exchanged between the server and one such interconnected system. Each traffic data hit is collected from the traffic data hits source as access information into one such record in at least one results table according to the data type associated with the one such results table. Each of the records in the results table corresponds to a different type of access information for the data type associated with the results table. The access information collected into the results tables during a time slice is summarized periodically into analysis results. The time slice corresponds to a discrete reporting period. The access information is analyzed from the results tables in the analysis results to form analysis summaries according to the data types associated with the results tables.

[0001] This application is a continuation of U.S. patent applicationSer. No. 09/425,280, filed Oct. 21, 1999, which is a continuation ofU.S. patent application Ser. No. 08/801,707, filed Feb. 14, 1997 (issuedas U.S. Pat. No. 6,112,238, on Aug. 29, 2000).

BACKGROUND OF THE INVENTION

[0002] This invention relates generally to remote traffic data analysisand more particularly to a system and method for analyzing remotetraffic data in a distributed computing environment.

[0003] The worldwide web (hereinafter “web”) is rapidly becoming one ofthe most important publishing mediums today. The reason is simple: webservers interconnected via the Internet provide access to a potentiallyworldwide audience with a minimal investment in time and resources inbuilding a web site. The web server makes available for retrieval andposting a wide range of media in a variety of formats, including audio,video and traditional text and graphics. And the ease of creating a website makes reaching this worldwide audience a reality for all types ofusers, from corporations, to startup companies, to organizations andindividuals.

[0004] Unlike other forms of media, a web site is interactive and theweb server can passively gather access information about each user byobserving and logging the traffic data packets exchanged between the webserver and the user. Important facts about the users can be determineddirectly or inferentially by analyzing the traffic data and the contextof the “hit.” Moreover, traffic data collected over a period of time canyield statistical information, such as the number of users visiting thesite each day, what countries, states or cities the users connect from,and the most active day or hour of the week. Such statisticalinformation is useful in tailoring marketing or managerial strategies tobetter match the apparent needs of the audience.

[0005] To optimize use of this statistical information, web servertraffic analysis must be timely. However, it is not unusual for a webserver to process thousands of users daily. The resulting accessinformation recorded by the web server amounts to megabytes of trafficdata. Some web servers generate gigabytes of daily traffic data.Analyzing the traffic data for even a single day to identify trends orgenerate statistics is computationally intensive and time-consuming.Moreover, the processing time needed to analyze the traffic data forseveral days, weeks or months increases linearly as the time frame ofinterest increases.

[0006] The problem of performing efficient and timely traffic analysisis not unique to web servers. Rather, traffic data analysis is possiblewhenever traffic data is observable and can be recorded in a uniformmanner, such as in a distributed database, client-server system or otherremote access environment.

[0007] One prior art web server traffic analysis tool is described in“WebTrends Installation and User Guide,” version 2.2, October 1996, thedisclosure of which is incorporated herein by reference. WebTrends is atrademark of e.g. Software, Portland, Oreg. However, this prior artanalysis tool cannot perform ad hoc queries using a log-based archivalof analysis summaries for efficient performance.

[0008] Other prior art web server traffic analysis tools are generallyeffective in handling modest volumes of server traffic data whenoperating on a small scale server or non-mainframe solution. Examples ofthese analysis tools include Market Focus licensed by InterséCorporation, Hit List licensed by MarketWave and Net.Analysis licensedby Net.Genisys. However, these analysis tools require increasinglyexpensive and complex hardware systems to handle higher traffic datavolumes. The latter approach is impracticable for the majority of webserver operators. Moreover, these prior art analysis tools are alsoincapable of rapidly generating trend and statistical information on anad hoc basis

[0009] Therefore, there is a need for a system and method to efficientlyprocess the voluminous amounts of access information generated by webservers in a timely, expedient manner without the attendant costsassociated with large scale hardware requirements. Preferably, such asystem and method could perform ad hoc queries of analysis summaries ina timely and accurate manner.

[0010] There is a further need for a system and method for efficientlyanalyzing traffic data reflecting access information on a web serveroperating in a distributed computing environment. Preferably, such asystem and method would process traffic data presented from a variety ofsources.

[0011] There is still a further need for a system and method foranalyzing traffic data consisting of access information for predefinedtime slices.

SUMMARY OF THE INVENTION

[0012] The present invention comprises a system and method for analyzingremote traffic data in a distributed computing environment in a timelyand accurate manner.

[0013] An embodiment of the present invention is a system, method andstorage medium embodying computer-readable code for analyzing trafficdata in a distributed computing environment. The distributed computingenvironment includes a plurality of interconnected systems operativelycoupled to a server, a source of traffic data hits and one or moreresults tables categorized by an associated data type. Each resultstable includes a plurality of records. The server is configured toexchange data packets with each interconnected system. Each traffic datahit corresponds to a data packet exchanged between the server and onesuch interconnected system. Each traffic data hit is collected from thetraffic data hits source as access information into one such record inat least one results table according to the data type associated withthe one such results table. Each of the records in the results tablecorresponds to a different type of access information for the data typeassociated with the results table. The access information collected intothe results tables during a time slice is summarized periodically intoanalysis results. The time slice corresponds to a discrete reportingperiod. The access information is analyzed from the results tables inthe analysis results to form analysis summaries according to the datatypes associated with the results tables.

[0014] The foregoing and other features and advantages of the inventionwill become more readily apparent from the following detaileddescription of a preferred embodiment of the invention which proceedswith reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015]FIG. 1 is a functional block diagram of a system for analyzingtraffic data in a distributed computing environment according to thepresent invention.

[0016]FIG. 2 is a flow diagram of a method for analyzing traffic data ina distributed computing environment according to the present inventionusing the system of FIG. 1.

[0017]FIG. 3A shows a format used in storing a “hit” of traffic datareceived by the server of FIG. 1.

[0018]FIG. 3B shows, by way of example, a “hit” of formatted trafficdata received by the server of FIG. 1.

[0019]FIG. 4 is a block diagram of the data structures used in storingaccess information determined from the traffic data hits of FIG. 3A.

[0020]FIG. 5 is a block diagram of a container file storing the accessinformation in the analysis results of FIG. 1.

[0021]FIG. 6 is a flow diagram of a routine for collecting andsummarizing access information used in the method of FIG. 2.

[0022]FIG. 7 is a flow diagram of a routine for storing accessinformation used in the routine of FIG. 6.

[0023]FIG. 8 is a flow diagram of a routine for summarizing accessinformation used in the routine of FIG. 6.

[0024]FIGS. 9A and 9B are a flow diagram of a one-pass routine foranalyzing access information used in the method of FIG. 2.

[0025]FIG. 10 is a flow diagram of a two-pass routine for analyzingaccess information used in the method of FIG. 2.

[0026]FIG. 11 is a graph of the number of open sessions as a function oftime received by the server of FIG. 1.

[0027]FIG. 12 is a flow diagram of steps for adjusting the collection ofaccess information for inflation used in the routine of FIG. 6.

[0028]FIG. 13 is a flow diagram of steps for adjusting the analysis ofaccess information for inflation used in the routine of FIGS. 9A and 9Band 10A-B.

DETAILED DESCRIPTION

[0029]FIG. 1 is a functional block diagram of a system for analyzingtraffic data in a distributed computing environment 9 according to thepresent invention. A server 10 provides web site and related services toremote users. By way of example, the remote users can access the server10 from a remote computer system 12 interconnected with the server 10over a network connection 13, such as the Internet or an intranetwork, adial up (or point-to-point) connection 14 or a direct (dedicated)connection 17. Other types of remote access connections are alsopossible.

[0030] Each access by a remote user to the server 10 results in a “hit”of raw traffic data 11. The format used in storing each traffic data hit11 and an example of a traffic data hit 11 are described below withreference to FIGS. 3A and 3B, respectively. The server 10 preferablystores each traffic data hit 11 in a log file 15, although a database 16or other storage structure can be used.

[0031] To analyze the traffic data, the server 10 examines each trafficdata hit 11 and stores the access information obtained from the trafficdata as analysis results 18A-C. Five sources of traffic data 11 (remotesystem 12, dial-up connection 14, log file 15, database 16 and directconnection 17) are shown. Other sources are also possible. The trafficdata hits 11 can originate from any single source or from a combinationof these sources. While the server 10 receives traffic data hits 11continuously, separate sets of analysis results 18A-C are stored foreach discrete reporting period, called a time slice. The analysisresults 18A-C are used for generating summaries 19A-C of the accessinformation.

[0032] In the described embodiment, the server 10 is typically an IntelPentium-based computer system equipped with a processor, memory,input/output interfaces, a network interface, a secondary storage deviceand a user interface, preferably such as a keyboard and display. Theserver 10 typically operates under the control of either the MicrosoftWindows NT or Unix operating systems and executes either MicrosoftInternet Information Server or NetScape Communications Server software.Pentium, Microsoft, Windows, Windows NT, Unix, Netscape and NetscapeCommunications Server are trademarks of their respective owners.However, other server 10 configurations varying in hardware, such asDOS-compatible, Apple Macintosh, Sun Workstation and other platforms, inoperating systems, such as MS-DOS, Unix and others, and in web softwareare also possible. Apple, Macintosh, Sun and MS-DOS are trademarks oftheir respective owners.

[0033]FIG. 2 is a flow diagram of a method 20 for analyzing traffic datain a distributed computing environment according to the presentinvention using the system of FIG. 1. Its purpose is to continuouslycollect and summarize access information from traffic data hits 11 whileallowing on-demand, ad hoc analyses. The method 20 consists of tworoutines. Access information is collected from traffic data hits 11 andsummarized by the server 10 into analysis results 18A-C (block 21), asfurther described below with reference to FIG. 6. The access informationis separately analyzed for generating the summaries 19A-C which identifytrends, statistics and other information (block 22), as furtherdescribed below with reference to FIGS. 9A and 9B. The collection andsummarizing of the access information (block 21) is performedcontinuously by the server 10 while the analysis of the accessinformation (block 22) is performed on an ad hoc basis by either theserver 10 or a separate workstation (not shown).

[0034] The method 20 is preferably implemented as a computer programexecuted by the server 10 and embodied in a storage medium comprisingcomputer-readable code. In the described embodiment, the method 20 iswritten in the C programming language, although other programminglanguages are equally suitable. It operates in a Microsoft Windowsenvironment and can analyze Common Log File, Combined Log File andproprietary log file formats from industry standard web servers, such asthose licensed by NetScape, NCSA, O'Reilly WebSite, Quarterdeck,C-Builder, Microsoft, Oracle, EMWAC, and other Windows 3.x, Windows NT95, Unix and Macintosh Web servers. The analysis results 18A-C can bestored in a proprietary or standard database 16 (shown in FIG. 1), suchas SQL, BTRIEVE, ORACLE, INFORMIX and others. The method 20 uses theanalysis results 18A-C of traffic data hits 11 as collected into the logfile 15 or database 16 for building activity, geographic, demographicand other summaries 19A-C, such as listed below in Table 1. Othersummaries 19A-C are also possible. TABLE 1 User Profile by RegionsGeneral Statistics Table Top Requested Pages Least Requested Pages TopEntry Pages Top Exit Pages Single Access Pages Top Paths Through SiteAdvertising Views Advertising Clicks Advertising Views and Clicks MostDownloaded Files Most Active Organizations Most Active CountriesActivity Summary by Day of Week Activity Summary by Day Activity Summaryby Hour of the Day Activity Summary Level by Hours of the Day Web ServerStatistics and Analysis Client Errors Top Downloaded File Types andSizes Server Errors Activity by Organization Type Top DirectoriesAccessed Top Referring Sites Top Referring URLs Top Browsers NetscapeBrowsers Microsoft Explorer Browsers Visiting Spiders Top Platforms

[0035] In addition, the analysis results 18A-C can be used forautomatically producing reports and summaries which include statisticalinformation and graphs showing, by way of example, user activity bymarket, interest level in specific web pages or services, which productsare most popular, whether a visitor has a local, national orinternational origin and similar information. In the describedembodiment, the summaries 19A-C can be generated as reports in a varietyof formats. These formats include hypertext markup language (HTML) filescompatible with the majority of popular web browsers, proprietary fileformats for use with word processing, spreadsheet, database and otherprograms, such as Microsoft Word, Microsoft Excel, ASCII files andvarious other formats. Word and Excel are trademarks of MicrosoftCorporation, Redmond, Wash.

[0036]FIG. 3A shows a format used in storing a “hit” of raw traffic data11 received by the server of FIG. 1. A raw traffic data hit 11 is not inthe format shown in FIG. 3A. Rather, the contents of each field in theformat is determined from the data packets exchanged between the server10 and the source of the traffic data hit 11 and the information pulledfrom the data packets is stored into a data record using the format ofFIG. 3A prior to being stored in the log file 15 (shown in FIG. 1) orprocessed.

[0037] Each traffic data hit 11 is a formatted string of ASCII data. Theformat is based on the standard log file format developed by theNational Computer Security Association (NCSA), the standard loggingformat used by most web servers. The format consists of seven fields asfollows: Field Name Description User Address (30): Internet protocol(IP) address or domain name of the user accessing the site. RFC931 (31):Obsolete field usually left blank, but increasingly used by many webservers to store the host domain name for multi- homed log files. UserAuthentication (32): Exchanges the user name if required for access tothe web site. Date/Time (33): Date and time of the access and the timeoffset from GMT. Request (34): Either GET (a page request) or POST (aform submission) command. Return Code (35): Return status of the requestwhich specifies whether the transfer was successful. Transfer Size (36):Number of bytes transferred for the file request, that is, the filesize. In addition, three optional fields can be employed as follows:Referring Site (37): URL used to obtain web site information forperforming the “hit”. Agent (38): Browser version, including make, modelor version number and operating system. Cookie (39): Unique identifierpermissively used to identify a particular user.

[0038] Other formats of traffic data hits 11 are also possible,including proprietary formats containing additional fields, such as timeto transmit, type of service operation and others. Moreover,modifications and additions to the formats of raw traffic data hits 11are constantly occurring and the extensions required by the presentinvention to handle such variations of the formats would be known to oneskilled in the art.

[0039]FIG. 3B shows, by way of example, a “hit” of raw traffic datareceived by the server of FIG. 1. The user address 30 field is“tarpon.gulf.net” indicating the user originates from a domain named“gulf.net” residing on a machine called “tarpon.” The RFC931 31 and userauthorization 32 fields are “−” indicating blank entries. The Date/Time33 field is “Jan. 12, 1996 20:38:17 +0000” indicating an access on Jan.12, 1996 at 8:38:17 pm GMT. The Request 34 field is “GET /general.htmHTTP/1.0” indicating the user requested the “general.htm” page. TheReturn Code 35 and Transfer Size 36 fields are 200 and 3599,respectively, indicating a successful transfer of 3599 bytes.

[0040]FIG. 4 is a block diagram of the data structures used in storingaccess information determined from the traffic data hits 11 of FIG. 3A.Users continuously access the server 10 during which time the server 10receives a series of “hits” from remote users for exchanginginformation, such as accessing a web page or posting a file. Users areidentified by the user's internet protocol (IP) address or domain name.The time during which the user is actively accessing the server 10 isknown as a session. An open session is defined as a period of activeactivity for one user of the server 10. By default, a user session isterminated when a user falls inactive for more than 30 minutes, althoughother time limits are equally suitable. An open user session can spantwo or more time slices which can artificially inflate the open sessioncount during the analysis of the access information (block 22) asfurther described below with reference to FIG. 11.

[0041] Each traffic data hit 11 is parsed to obtain pertinent accessinformation. While a traffic data hit 11 mainly contains formatted dataas described with reference to FIG. 3A, access information is broaderand includes data derived from the context of the “hit,” such as thecity or state of the referring site. In the described embodiment, adatabase of both U.S. and international Internet addresses (not shown),including full company name, city, state and country, is maintained forinferring such indirect access information about each user. The accessinformation is then used to populate a set of results tables 40A-D. Eachtable stores a particular type of access information, such as the state,city or country of the user, the page within the web site beingaccessed, the source web site, a Universal Resource Locator (URL) andother information either directly or inferentially derivable from thetraffic data hit 11. At the end of the time slice, the results tables40A-D are summarized into a container file 41, further described belowwith reference to FIG. 5, which is stored in the analysis results 18A-C.

[0042] The results tables 40A-C are categorized according to the type ofaccess information being counted and each results table 40A contains aset of records 42 for storing the access information. In the describedembodiment, there are two types of tables. Static tables contain a fixedand predefined set of records 42, such as the set of pages in the website being measured. Dynamic tables are of an undetermined length andcan have zero or more records. A new record 42 must be created in theresults table 40A each time new access information is encountered.

[0043] For example, in a dynamic results table 40A for storing the statefrom which the user originates, a record might contain “TX: 5, 500”indicating the user's state is Texas with five user sessions and 500hits recorded so far. If the next traffic data hit 11 originates from anew user from Texas, this record 42 will be updated to “TX: 6, 501”indicating six user sessions with 501 hits. If the next traffic data hit11 originates from yet another new user from California, a new record 42will be created containing “CA: 1, 1” indicating the user's state isCalifornia with one user session and one hit. In addition to the set ofresults tables 40A-D, the server 10 maintains a user session table 43for tracking the open user sessions during each time slice which is usedin a further embodiment described below with reference to FIGS. 12-13.

[0044]FIG. 5 is a block diagram of a container file 41 storing theaccess information in the analysis results 18A-C of FIG. 1. Eachcontainer file 41 contains a table of contents 44 mapping out therelative locations of each results table 40A-D within the container file41. The user session table 43 is also stored in the container file 41and contains a series of pointers to a set of microtables 45A-C. Eachmicrotable 45A-C corresponds to one of the results tables 40A-Dpotentially containing an inflated count of open sessions. Each entry ina microtable 45A contains an index 46 pointing to a record within itsassociated results table 40B which requires adjustment for inflation.However, not every results table 40A-D has an associated microtable45A-C. Rather, the total number of microtables 45A-C is less than orequal to the number of results tables 40A-D since every results table40A-D does not contain inflated information.

[0045] For example, the state from which a user originates is countedonce during each session. Since it is only counted once, the number ofopen user sessions for any given state is not inflated. Consequently, nomicrotable 45 is needed for the results table 40A for states.Conversely, a page in a web site can be accessed numerous times duringan open session. Thus, a microtable 45A is required. A count of thenumber of open sessions spanning each time slice boundary is made in theuser session table 43, as described below with reference to FIG. 12 andan entry is made in the user session table 43 pointing to acorresponding microtable 45A. In turn, each entry within the microtable45A contains an index to a particular record within the results table 40b for web pages. During analysis, the access information is adjusted toremove the inflation as described below with reference to FIG. 13.

[0046]FIG. 6 is a flow diagram of a routine for collecting andsummarizing access information (block 21) used in the method of FIG. 2.Its purpose is to iteratively process traffic data hits 11 during thecurrent time slice and to thereafter summarize the results. The accessinformation is not adjusted for inflation due to the double, triple ormultiple counting of open sessions spanning multiple time slices.Inflation adjustment is unnecessary if the access information beingsummarized is counted just once. However, a further embodiment of thepresent method is described below with reference to FIGS. 11 and 12 foradjusting the analysis results for inflation where such adjustment isneeded.

[0047] The routine is executed by the server 10 once during each timeslice. First, the static results tables 40A-D, if any, are initialized(block 50). The routine then enters a loop (blocks 51-54) forcontinuously handling a stream of traffic data hits 11. A “hit” of rawtraffic data 11 is received (block 51) in the log file format describedwith reference to FIG. 3A. In the described embodiment, 99% of thetraffic data hits 11 are received from the log file 15 (shown in FIG.1), although the traffic data hits 11 could also be received from othersources. Next, the raw traffic data 11 is parsed for access information(block 52). Access information includes but is not limited to thecontents of the fields of the log file format described with referenceto FIG. 3A. In addition, the access information includes contextualinformation derived from the hit, such as the particular web pageaccessed, the day of the week, the hour of the day and so forth. Theaccess information is stored into the pertinent results table 40A-D(block 53) as farther described below with reference to FIG. 7. If thecurrent time slice has not yet ended (block 54), processing continueswith the next traffic data hit 11 at the top of the processing loop(blocks 51-54). Otherwise, if the time slice has ended (block 54), theaccess information is summarized into a container file 41 (block 55), asfurther described below with reference to FIG. 8 and the routinereturns.

[0048]FIG. 7 is a flow diagram of a routine for storing the accessinformation (block 53) used in the routine of FIG. 6. Its purpose is toiteratively populate each of the results tables 40A-D with the accessinformation parsed and inferred from each traffic data hit 11. Theaccess information is categorized according to the results tables 40A-D.The routine enters a processing loop (blocks 60-65) for continuouslypopulating a results table 40A with access information, if appropriate.Thus, a pertinent results table 40A is located (block 60). If theresults table 40A is not static (block 61) and a record for storing thistype of access information does not exist in this results table 40A(block 62), a record is created (block 63). Otherwise, if the resultstable 40A is dynamic (block 61) or if the results table 40A is staticyet a record for storing this type of access information already exists(block 62), the access information is stored into the record for storingthis type of access information in the results table 40A (block 64). Ifall the access information for the current traffic data hit 11 has notbeen stored in to a results table 40A (block 65), processing continuesat the top of the processing loop (blocks 60-65). Otherwise, if allaccess information has been stored (block 65), the routine returns.

[0049]FIG. 8 is a flow diagram of a routine for summarizing accessinformation (block 55) used in the routine of FIG. 6. Its purpose is toiteratively summarize each of the results tables 40A-D into a containerfile 41 stored with the analysis results 18A-C (shown in FIG. 1). Theroutine enters a processing loop (blocks 70-72) for continuouslysummarizing each results table 40A. Thus, a results table 40A isobtained (block 70). The results table 40A is stored into a containerfile 41 by copying the results table 40A into the container file 41 andupdating the table of contents 44 of the container file 41 to reflectthe relative position of the results table 40A within the container file41. If all of the results tables 40A-D have not been summarized (block72), processing continues at the top of the processing loop (blocks70-72). Otherwise, if all of the results tables 40A-D have beensummarized (block 72), the routine returns.

[0050] In the two preceding routines for respectively storing andsummarizing access information, described with reference to FIGS. 7 and8, respectively, an iterative loop (blocks 60-65 in FIG. 7 and blocks70-72 in FIG. 8) was employed for sequentially processing each of theresults table 40A-D. However, a further embodiment of the presentinvention uses a selection statement instead of a looping construct todirectly access each results table 40A.

[0051]FIGS. 9A and 9B and FIG. 10 are flow diagrams respectively ofone-pass and two-pass routines for analyzing access information used inthe method of FIG. 2. The one-pass routine (FIGS. 9A and 9B) minimizesthe number of data accesses performed in analyzing the accessinformation. The two-pass routine (FIG. 10) minimizes the number ofprogram variables required. Either routine is equally suitable foranalyzing the access information depending upon the particularconfiguration of the server 10 or workstation (not shown) used toperform the analysis.

[0052]FIGS. 9A and 9B are the flow diagram of a one-pass routine foranalyzing access information (block 22) used in the method of FIG. 2.Its purpose is to analyze and summarize the access information recordedfor a user-requested time frame on an ad hoc basis in a single passthrough the analysis results 18A-C. The time frame can be smaller than,equal to or larger than the time slice used by the access informationcollection and summarization routine (block 21 in FIG. 2). The routineautomatically divides the requested time frame into smaller time slicesand stores the analysis summaries for each of the time slices to allowgreater flexibility and speed in subsequent reports of the same orrelated time frame.

[0053] Briefly, the routine creates a container file 41 for storingsummarized access information for the requested time frame if such acontainer file 41 does not already exist in the analysis results 18A-C(shown in FIG. 1). The new container file 41 is maintained in theanalysis results 18A-C for immediate access in subsequent requests andre-analysis of the time slices is avoided.

[0054] The routine is hierarchically structured according to increasingprocessing demand based on the availability of summarized accessinformation in the analysis results 18A-C. At the bottom of thehierarchy (blocks 81-82), the routine uses any available analysisresults 18A-C stored in a container file 41. At the next level of thehierarchy (blocks 83-85), the routine summarizes collected butunsummarized access information. At the top of the hierarchy (blocks86-87), the routine collects and summarizes raw traffic data hits 11.This hierarchical structuring enables the server 10 to efficientlyanalyze the traffic data by utilizing existing summaries 19A-C wheneverpossible and thereby avoid the need to process raw traffic data 11 foreach time slice in the time frame every time a new analysis request ismade.

[0055] In the routine, the time frame of interest is defined (block 80).If any analysis summaries for the requested time frame already exist ina container file 41 stored in the analysis results 18A-C (block 81), theavailable analysis summaries are summarized (block 82). This step isskipped if no analysis summaries already exist (block 81). Next, if anyanalysis summaries are missing (block 83), the next stage in thehierarchy is performed. Specifically, if any unsummarized analysisresults for the time frame already exist (block 84), the accessinformation for each time slice in the requested time frame for theunsummarized analysis results are summarized (block 55), as describedabove with reference to FIG. 8. These analysis results are then added tothe summary (block 85). However, these last two steps are skipped if nounsummarized analysis results for the time frame already exist (block84). If analysis summaries are still missing (block 86), the last stagein the hierarchy is performed. Specifically, access information for eachtime slice in the requested time frame for the remaining missinganalysis results are collected and summarized (block 21), as describedabove with reference to FIG. 6. These analysis results are then added tothe summary (block 87). Once no further analysis results are missing(blocks 83 and 86), the analysis of the requested time frame is complete(block 88) and the routine returns.

[0056]FIG. 10 is the flow diagram of a two-pass routine for analyzingaccess information used in the method of FIG. 2. Its purpose is toanalyze and summarize the access information recorded for auser-requested time frame on an ad hoc basis in two passes through theanalysis results 18A-C. The first pass (blocks 121-21) “inventories”available analysis results 18A-C and creates any missing analysissummaries as needed. The second pass (block 125) collects and completesthe analysis.

[0057] In the routine, the time frame of interest is defined (block120). The analysis summaries for the requested time frame alreadyexisting in a container file 41 stored in the analysis results 18A-C areinventoried for determining any gaps in the data (block 121). If anyanalysis summaries are missing (block 122), the next stage in thehierarchy is performed. Specifically, if any unsummarized analysisresults for the time frame already exist (block 123), the accessinformation for each time slice in the requested time frame for theunsummarized analysis results are summarized (block 55), as describedabove with reference to FIG. 8. However, this step is skipped if nounsummarized analysis results for the time frame already exist (block123). If analysis summaries are still missing (block 124), the laststage in the hierarchy is performed. Specifically, access informationfor each time slice in the requested time frame for the remainingmissing analysis results are collected and summarized (block 21), asdescribed above with reference to FIG. 6. The analysis of the requestedtime frame is then completed (block 125) and the routine returns.

[0058]FIG. 11 is a graph of the number of open sessions as a function oftime received by the server of FIG. 1. As explained above, the methoddescribed with reference to FIGS. 6-9 assumes that the accessinformation is not inflated by double, triple or multiple counting ofopen sessions spanning multiple time slices. This type of adjustment isunnecessary where the access information is counted only once during theentire user session. However, many types of traffic data hits 11, suchas web page accesses, can result in multiple counting. In the graphshown in FIG. 11, the number of open sessions 90 is tallied as afunction of time. Each new traffic data hit 11 causes an additional opensession to be counted. The boundary between two time slices 91 straddlesa “bump” 92 of multiply-counted open sessions which inflates the numberof open sessions 90 counted. The “bump” 92 occurs because each opensession is in effect counted twice, thrice or multiple times in theresults tables 40A-D for each respective time slice. The net result isan inflated figure for the number of open sessions during which the itemof interest was accessed.

[0059] For example, assume the server 10 saves analysis results 18A oncefor each 24 hour time slice starting at 00:00:00 and ending at 23:59:59.Users that visit the server from, for instance, 23:50:00 until 00:30:00will be registered twice: once in the analysis results 18A for the firsttime slice and once in the analysis results for the second time slice.Thus, suppose the item of interest is the number of open sessions duringwhich a particular web page was accessed and the time frame on interestwas just the first and second time slice. Each new traffic data hit 11for this web page requested by a user with an open session fallingbetween 23:50:00 and 00:30:00 will result in a double-count for thesecond time slice if that user already accessed this web page during thefirst time slice. The summary of the time frame of the first and secondtime slice will be inflated unless the double-counts are subtracted fromthe number of open sessions for this web page for the second time slice.

[0060] To resolve this problem, a further embodiment of the presentinvention introduces additional steps into the method described withreference to FIGS. 6-9 to “remember” and store with each analysissummary the number of open sessions visits remaining at the end of thetime slice. This allows the method to count those open sessions spanningtwo or more time slices and deinflate the analysis summariesaccordingly.

[0061] For example, if a user is visiting the server 10 from Day X at23:50:00 to Day X+1 at 00:30:00, the server 10 will store the useridentifier, such as the user's name, IP address, cookie or otherindication, with the analysis summary of Day X. Later, when the analysissummary for Day X and Day X+1 are combined, the number of open sessionscan be adjusted to compensate any multiple counting.

[0062] The additional steps are introduced into both the routine forcollecting and summarizing access information (block 21 in FIG. 2) for“remembering” any multiple counts and the routine for analyzing theaccess information (block 22 in FIG. 2) for adjusting the open sessioncounts during analysis. FIG. 12 is a flow diagram of steps for adjustingthe collection of access information for inflation used in the routineof FIG. 6 which are inserted after the step of summarizing the accessinformation (block 55). Thus, if there are any sessions open remainingat the end of the time slice (block 101), the number of open sessionsare stored with the analysis results and the user session table 43 isupdated with the relative location of each of the associated microtables45A-D in the container file 41 (block 102). Otherwise, if no opensessions exist (block 101), no further steps need be taken.

[0063]FIG. 13 is a flow diagram of steps for adjusting the analysis ofaccess information for inflation used in the routines of FIGS. 9A, 9Band 10 which are inserted after each step during which the summary ofanalysis results is updated (blocks 82, 85 and 87 in FIGS. 9A and 9B andblock 127 in FIG. 10). Thus, the time slice in the requested time frameis selected (block 111). If this is not the last time slice in therequested time frame (block 112), the number of open sessions for theprior time slice is deducted from the analysis results for the currenttime slice (block 112), thereby deinflating the count and processingcontinues with the next time slice in the requested time frame (block111). Otherwise, if this is the last time slice in the requested timeframe (block 112), processing is complete.

[0064] In the described embodiment, the number of open sessionscorresponding to certain types of data values collected for use in thesummaries 19A-C (listed in Table 1) are counted just once. These are thedata types which are generally not likely to change and include, forexample, the referring web site, city, state, country, day of the week,region, organization type, browser and operating system type. Nomicrotables 45A-C are needed for adjusting the open session countscorresponding to these data types. However, the number of open sessionscorresponding to all other types of data values are counted continuouslythroughout the user session. Microtables 45A-C are required for thesedata types.

[0065] Session counts are maintained for each of the summaries 19A-Cregardless of the data type, although the session counts are notnecessarily used during the analysis of the access information (block22) to deinflate the corresponding results tables 40A-D. Also, nomicrotables 45A-C are maintained for these non-adjusted results tables40A-D. However, to convert non-adjusted results tables 40A-D to adjustedresults tables 40A-D merely requires forming an associated microtable45A. This conversion would be necessary where, for instance, a data typeformerly counted once per session is modified to allow for continuouscounting.

[0066] Having described and illustrated the principles of the inventionin a preferred embodiment thereof, it should be apparent that theinvention can be modified in arrangement and detail without departingfrom such principles. We claim all modifications and variations comingwithin the spirit and scope of the following claims.

1. A system for analyzing traffic data in a distributed computingenvironment, the distributed computing environment comprising aplurality of interconnected systems operatively coupled to a server, theserver configured to exchange data packets with each interconnectedsystem, comprising: a source of traffic data hits, each traffic data hitcorresponding to a data packet exchanged between the server and one suchinterconnected system; one or more results tables categorized by anassociated data type, each results table comprising a plurality ofrecords; means for collecting each traffic data hit from the trafficdata hits source as access information into one such record in at leastone results table according to the data type associated with the onesuch results table, each of the records in the results tablecorresponding to a different type of access information for the datatype associated with the results table; means for summarizingperiodically the access information collected into the results tablesduring a time slice into analysis results, the time slice correspondingto a discrete reporting period; and means for analyzing the accessinformation from the results tables in the analysis results to formanalysis summaries according to the data types associated with theresults tables.
 2. A system according to claim 1, wherein each suchinterconnected system interfaces to the server via one of a networkconnection, a point-to-point connection and a dedicated connection.
 3. Asystem according to claim 1, wherein the server further comprises a logfile operatively coupled to the server and storing the traffic datahits, the log file operating as the source of traffic data hitsresponsive to the collecting means.
 4. A system according to claim 1,wherein the server further comprises a database operatively coupled tothe server and storing at least one of the traffic data hits and theanalysis results, the database operating as the source of traffic datahits responsive to the collecting means.
 5. A system according to claim1, further comprising: a user session table comprising one or morerecords which each store a pointer, each pointer corresponding to one ofthe results tables, the collecting means including a user session countfor each such data type associated with each such results table, theuser session count being stored in the user session table in each of therecords; and one or more microtables, each of the microtables includingone or more indices and being associated with one of the results tables,each such index within the microtable logically referring to each suchdifferent type of access information collected in the associated resultstable, each such pointer in the user session table further logicallyreferring to one of the microtables, the analyzing means furthercomprising means for adjusting the user session count for consecutivetime slices.
 6. A system according to claim 5, further comprising: acontainer file comprising a table of contents and configured to storethe one or more results tables, the user sessions table and the one ormore microtables, the summarizing means further comprising means formapping relative positions of each such results table within thecontainer file into the table of contents and storing each such pointerin the user session table with the relative positions of each suchmicrotable within the container file.
 7. A method for analyzing trafficdata in a distributed computing environment, the distributed computingenvironment comprising a plurality of interconnected systems operativelycoupled to a server, a source of traffic data hits and one or moreresults tables categorized by an associated data type, each resultstable comprising a plurality of records, the server configured toexchange data packets with each interconnected system, each traffic datahit corresponding to a data packet exchanged between the server and onesuch interconnected system, the method comprising the steps of:collecting each traffic data hit from the traffic data hits source asaccess information into one such record in at least one results tableaccording to the data type associated with the one such results table,each of the records in the results table corresponding to a differenttype of access information for the data type associated with the resultstable; summarizing periodically the access information collected intothe results tables during a time slice into analysis results, the timeslice corresponding to a discrete reporting period; and analyzing theaccess information from the results tables in the analysis results toform analysis summaries according to the data types associated with theresults tables.
 8. A method according to claim 7, further comprising thestep of interfacing each such interconnected system to the server viaone of a network connection, a point-to-point connection and a dedicatedconnection.
 9. A method according to claim 7, further comprising thesteps of: operatively coupling the server to a log file; and storing thetraffic data hits from the traffic data hits source into the log file,the log file operating as the source of traffic data hits in the step ofcollecting each traffic data hit.
 10. A method according to claim 7,further comprising the steps of: operatively coupling the server to adatabase; and storing at least one of the traffic data hits from thetraffic hits source and the analysis results into the log file, thedatabase operating as the source of traffic data hits in the step ofcollecting each traffic data hit.
 11. A method according to claim 7,wherein the distributed computing environment further comprises a usersession table comprising one or more records which each store a pointerand one or more microtables, each pointer corresponding to one of theresults tables, each of the microtables including one or more indicesand being associated with one of the results tables, each such indexwithin the microtable logically referring to each such different type ofaccess information collected in the associated results table, each suchpointer in the user session table further logically referring to one ofthe microtables, the method further comprising the steps of: counting auser session for each such data type associated with each such resultstable; storing the user session count stored in the user session tablein each of the records; and adjusting the user session count forconsecutive time slices.
 12. A method according to claim 11, thedistributed computing environment further comprising a container filecomprising a table of contents and configured to store the one or moreresults tables, the user sessions table and the one or more microtables,the method further comprising the steps of: mapping relative positionsof each such results table within the container file into the table ofcontents; and storing each such pointer in the user session table withthe relative positions of each such microtable within the containerfile.
 13. A method according to claim 7, the step of analyzing theaccess information further comprising the steps of: defining a timeframe comprising a discrete period of time; and analyzing the analysisresults for each such time slice occurring within the time frame basedon availability of access information in the analysis results.
 14. Amethod according to claim 13, the step of analyzing the analysis resultscomprising one pass and further comprising the steps of: summarizing theaccess information available as analysis summaries in the analysisresults; performing the step of summarizing periodically the accessinformation for each such time slice occurring within the time framewhereby analysis summaries are not available but access information fromthe results table is available in the analysis results; summarizing theanalysis summaries formed in the preceding step; performing the steps ofcollecting each traffic data hit and summarizing periodically the accessinformation for each such time slice occurring in the time frame wherebyanalysis summaries and access information from the results table are notavailable in the analysis results; and summarizing the analysissummaries formed in the preceding step.
 15. A method according to claim13, the step of analyzing the analysis results comprising two passes andfurther comprising the steps of: performing the step of summarizingperiodically the access information for each such time slice occurringwithin the time frame whereby analysis summaries are not available butaccess information from the results table is available in the analysisresults; performing the steps of collecting each traffic data hit andsummarizing periodically the access information for each such time sliceoccurring in the time frame whereby analysis summaries and accessinformation from the results table are not available in the analysisresults; and summarizing the access information available as analysissummaries in the analysis results.
 16. A storage medium embodyingcomputer-readable code for analyzing traffic data in a distributedcomputing environment, the distributed computing environment comprisinga plurality of interconnected systems operatively coupled to a server, asource of traffic data hits and one or more results tables categorizedby an associated data type, each results table comprising a plurality ofrecords, the server configured to exchange data packets with eachinterconnected system, each traffic data hit corresponding to a datapacket exchanged between the server and one such interconnected system,comprising: means for collecting each traffic data hit from the trafficdata hits source as access information into one such record in at leastone results table according to the data type associated with the onesuch results table, each of the records in the results tablecorresponding to a different type of access information for the datatype associated with the results table; means for summarizingperiodically the access information collected into the results tablesduring a time slice into analysis results, the time slice correspondingto a discrete reporting period; and means for analyzing the accessinformation from the results tables in the analysis results to formanalysis summaries according to the data types associated with theresults tables.
 17. A storage medium according to claim 16, wherein thedistributed computing environment further comprises a user session tablecomprising one or more records which each store a pointer and one ormore microtables, each pointer corresponding to one of the resultstables, each of the microtables including one or more indices and beingassociated with one of the results tables, each such index within themicrotable logically referring to each such different type of accessinformation collected in the associated results table, each such pointerin the user session table further logically referring to one of themicrotables, further comprising: means for counting a user session foreach such data type associated with each such results table; means forstoring the user session count stored in the user session table in eachof the records; and means for adjusting the user session count forconsecutive time slices.
 18. A storage medium according to claim 16, themeans for analyzing the access information further comprising: means fordefining a time frame comprising a discrete period of time; and meansfor analyzing the analysis results for each such time slice occurringwithin the time frame based on availability of access information in theanalysis results.
 19. A storage medium according to claim 16, the meansfor analyzing the analysis results comprising one pass and furthercomprising: means for summarizing the access information available asanalysis summaries in the analysis results; means for performing thestep of summarizing periodically the access information for each suchtime slice occurring within the time frame whereby analysis summariesare not available but access information from the results table isavailable in the analysis results; means for summarizing the analysissummaries formed in the preceding step; means for performing the stepsof collecting each traffic data hit and summarizing periodically theaccess information for each such time slice occurring in the time framewhereby analysis summaries and access information from the results tableare not available in the analysis results; and means for summarizing theanalysis summaries formed in the preceding step.
 20. A storage mediumaccording to claim 16, the means for analyzing the analysis resultscomprising two passes and further comprising: means for performing thestep of summarizing periodically the access information for each suchtime slice occurring within the time frame whereby analysis summariesare not available but access information from the results table isavailable in the analysis results; means for performing the steps ofcollecting each traffic data hit and summarizing periodically the accessinformation for each such time slice occurring in the time frame wherebyanalysis summaries and access information from the results table are notavailable in the analysis results; and means for summarizing the accessinformation available as analysis summaries in the analysis results.